Coding Best Practice

Dr Siân Bladon, PhD

What We Will Cover in This Session


  • Why this is important/useful
  • Readable code
  • Naming things
  • Organising your projects
  • Version control


Disclaimer - some of it is quite specific to coding in R

Why This is Important


Good coding practices are important because it helps ensure that your code is:

  • readable - is the code written in a clear and readable way?

  • understandable - is the code easy to follow? Is it clear what is being done at each stage?

  • reproducible - could another person be able to re-run the code?

Why This is Important


This is useful for working on any type of coding project.
For solo projects, you may think your code makes perfect sense when you are writing it, but when you look back it does not…
For working with others it is important it not only makes sense to you but to your collaborators

Why This is Important


Introducing some of these practices into your workflow at the start of projects can help them run more smoothly, avoid potential confusion and save time further down the line.


There is a move for science to become more reproducible, with more researchers making their code and data available when publishing a paper.
This means your code may be viewed by a wider range of people and, therefore, even more important the code is reproducible.

Resources


Data Camp tutorial


R4DS Sections 2, 4 and 6


TidyVerse style guide


Google Style Guides - including R and Python

Readable Code


There are a few things you can do to make your code easier to read

youshouldavoidwritingcodelikethisasitisdifficulttoreadthesentence

Readable Code

thisisabadnameforafunction <- function()


Do not use spaces in names

this is also a bad name for a function <- function()

Readable Code

For naming variables and functions there are two common conventions you can use:

  • camelCase e.g. thisIsABetterNameForAFunction <- function()

  • snake_case e.g. this_is_an_even_better_name_for_a_function <- function()


Personal preference but be consistent

Readable Code


The janitor package in R has useful functions for cleaning variable names

# A tibble: 6 × 5
  `row ID` `Organisation Name` `Patient Age` `LENGTH OF STAY` Death.flag
     <dbl> <chr>                       <dbl>            <dbl>      <dbl>
1        1 Trust1                         55                2          0
2        2 Trust2                         27                1          0
3        3 Trust3                         93               12          0
4        4 Trust4                         45                3          1
5        5 Trust5                         70               11          0
6        6 Trust6                         60                7          0

Readable Code


The janitor package in R has useful functions for cleaning variable names

data <- clean_names(data)

head(data, n = 4)
# A tibble: 4 × 5
  row_id organisation_name patient_age length_of_stay death_flag
   <dbl> <chr>                   <dbl>          <dbl>      <dbl>
1      1 Trust1                     55              2          0
2      2 Trust2                     27              1          0
3      3 Trust3                     93             12          0
4      4 Trust4                     45              3          1


Default is to use snake_case

Readable Code


The janitor package in R has useful functions for cleaning variable names

data <- clean_names(data, case = "upper_camel")

head(data, n = 4)
# A tibble: 4 × 5
  RowId OrganisationName PatientAge LengthOfStay DeathFlag
  <dbl> <chr>                 <dbl>        <dbl>     <dbl>
1     1 Trust1                   55            2         0
2     2 Trust2                   27            1         0
3     3 Trust3                   93           12         0
4     4 Trust4                   45            3         1


But you can specify to use CamelCase if you prefer

Readable Code


Use spaces in lines of code to separate names,functions & operators

# bad

data$patient_age_grp<-if_else(data$patient_age<55,0,1)

trust1<-data[data$organisation_name=="Trust1",]

# better

data$patient_age_grp <- if_else(data$patient_age <= 55, 0, 1)

trust1 <- data[data$organisation_name == "Trust1", ]

Readable Code


Note the exception for between a function name and opening of brackets

# like this 
data$patient_age_grp <- if_else(data$patient_age <= 55, 0, 1)

# not like this

data$patient_age_grp <- if_else (data$patient_age <= 55, 0, 1)

Readable Code


Avoid lines that are too long

# bad

ggplot(data) +
  geom_point(aes(x = patient_age, y = length_of_stay, colour = as.factor(death_flag))) +
  theme_minimal() +
  labs(title = "Age and length of stay of patients at 10 hospital trusts", x = "Patient Age (years)", y = "Patient Length of Stay (Days)")
# better

ggplot(data) +
  geom_point(aes(x = patient_age, 
                 y = length_of_stay, 
                 colour = as.factor(death_flag))) +
  theme_minimal() +
  labs(title = "Age and length of stay of patients at 10 hospital trusts", 
       x = "Patient Age (years)", 
       y = "Patient Length of Stay (Days)")

Readable Code


If using the tidyverse or ggplot2 then start a new line after each %>% or +

data %>%
  filter(organisation_name == "Trust1") %>%
  ggplot(aes(x = patient_age, 
             y = length_of_stay, 
             colour = as.factor(death_flag))) +
  geom_point() +
  theme_minimal()

Readable Code


Use functions to avoid repeating lines of code
General rule of thumb is that if you copy and paste a section of code more than two times then you should make a function


Chapter 25 of the R4DS book is a useful place to start

Readable Code


Example of a simple function

function_example <- function(x, y) {
  (x + y) / 2
}

function_example(7, 15)
[1] 11

Readable Code


Slightly more complex example

mean_function <- function(org_name, var) {
  
  data %>%
    filter(organisation_name == org_name) %>%
    summarise(mean_var = mean(var))
  
} 

mean_function("Trust1", data$patient_age)
# A tibble: 1 × 1
  mean_var
     <dbl>
1     50.7
mean_function("Trust2", data$length_of_stay)
# A tibble: 1 × 1
  mean_var
     <dbl>
1     4.94

Readable Code


You can use the purrr package to apply a function to multiple items in a list
Or in base R the lapply, sapply, vapply functions do similar

plot_function <- function(org_name) {
  
  age_los_plot <- data %>%
  filter(organisation_name == org_name) %>%
  ggplot(aes(x = patient_age, 
             y = length_of_stay, 
             colour = as.factor(death_flag))) +
  geom_point() +
  theme_minimal() +
  labs(x = "Patient Age (Years)",
       y = "Length of Stay (Days)")
  
  age_los_plot

}

orgs_list <- list("Trust1", "Trust2", "Trust3")

purrr::map(orgs_list, plot_function)

Readable Code

Readable Code

Use comments to annotate your code so it is easier to follow.
Particularly for documenting WHY you have done something

# Anything preceded by a # will not be executed by R

# 10*15

10*20
[1] 200

Readable Code

Use comments to annotate your code so it is easier to follow.
Particularly for documenting WHY you have done something

# At the start of an R script I usually write a few lines describing what the script is for, 
# what the input data is and what the expected outputs are. 
# Then use comments throughout to break up the code and explain the analysis. Example:

# 15 patients in dataset with their age missing, excluding them from analysis

data <- data %>%
  filter(!is.na(patient_age))

# n = 285 from here

Naming Things


When you are naming new variables choose names that are descriptive.
Do not duplicate names

# bad

data %>%
  group_by(organisation_name) %>%
  summarise(n = sum(death_flag),
            mean_1 = mean(patient_age),
            mean_2 = mean(length_of_stay)) %>%
  head(n = 4)
# A tibble: 4 × 4
  organisation_name     n mean_1 mean_2
  <chr>             <dbl>  <dbl>  <dbl>
1 Trust1                7   55.4   5.07
2 Trust10               4   51.0   4.3 
3 Trust2                5   51.2   4.23
4 Trust3                6   47.9   5.07

Naming Things


When you are naming new variables or functions choose names that are descriptive.
Do not duplicate names

# better

data %>%
  group_by(organisation_name) %>%
  summarise(n_deaths = sum(death_flag),
            mean_patient_age = mean(patient_age),
            mean_length_of_stay = mean(length_of_stay)) %>%
  head(n = 4)
# A tibble: 4 × 4
  organisation_name n_deaths mean_patient_age mean_length_of_stay
  <chr>                <dbl>            <dbl>               <dbl>
1 Trust1                   7             55.4                5.07
2 Trust10                  4             51.0                4.3 
3 Trust2                   5             51.2                4.23
4 Trust3                   6             47.9                5.07

Naming Things


When you are naming new variables or functions choose names that are descriptive.
Do not duplicate names

# bad

model_a <- glm(data$patient_age ~ data$length_of_stay, family = gaussian())

model_b <- glm(as.factor(data$death_flag) ~ data$patient_age, family = binomial())

# better

model_los_age <- glm(data$length_of_stay ~ data$patient_age, 
                     family = gaussian())

model_death_age <- glm(as.factor(data$death_flag) ~ data$patient_age, 
                       family = binomial())

Naming Things


For naming files again use descriptive names. If working on a larger project then consider having a separate file for each stage of the project, and make it clear what order the analysis has been done in.

For example:
01_data_cleaning.R
02_baseline_characteristics.R
03_descriptive_stats.R
04_models.R
05_figures.R

Organising Your Work


Within an R script you can use sections to organise your scripts.


Insert a new section using ctrl + shift + R and navigate using the document outline on the right of the script

Organising Your Work

Organising Your Work


Working within an R Project is a good way to organise not only your R scripts but keeps all the data and outputs from your work in the same place.

Avoids the need to use set_wd() at the start of your scripts, which is not best practice, particularly when collaborating with others.

Organising Your Work


set_wd() uses absolute file paths, e.g.

setwd("C:/Users/mfbx9sbk/OneDrive - The University of Manchester/MSc Teaching/coding_best_practice_2")


This can cause problems when you are collaborating with others, as not everyone will have their files organised in the same way.

Organising Your Work


R Projects use relative file paths, which are relative to the working directory of the project.

For example, you want to save a cleaned version of your data, or a plot you have generated.

Organising Your Work


Here the file paths are relative to the Project directory

ggplot(data) +
  geom_point(aes(x = patient_age, y = length_of_stay))
ggsave("figs/age_los_scatter.png")


write_csv(data, "data/trust_los_clean.csv")


So if you shared the project with another person then it would not matter where they saved the project, all the file paths would work.

Organising Your Work

Organising Your Work

Organising Your Work

Organising Your Work


To set up an R Project go to File -> New Project

Organising Your Work

Organising Your Work

Organising Your Work

Organising Your Work


Use the README.MD document to briefly describe your project, including what you have done and what the output is.


Organising Your Work


Use the source() function to call and run other R scripts within your current script.

For example, you may want to automatically run your data cleaning script before your analysis script

source("scripts/01_data_cleaning.R")

Organising Your Work



Or, you may have a separate script with functions you want to use


source("scripts/00_functions.R")

Organising Your Work


This can be useful for building Reproducible Analytical Pipelines, where analysis can be run repeatedly and reproducibly.


More info on that can be found here

Version Control


If you have ever had a bunch of files that look something like this then you may want to consider using a version control system to manage your projects

Version Control


Using a version control system can:

  • help organise your work and keep track of updates and changes
  • make it easier to collaborate with others
  • create a repository that can be shared more widely when a project is complete
  • be difficult to navigate at first but quickly become integrated into your regular workflow

Version Control


The most widely used (in the data science community) software for version control is Git.
Git takes snapshots of all files in a project at a specific time - referred to as a “commit”.
It stores the initial version and any subsequent updated versions that are committed.
It tracks any changes you have made at each commit, which can be identified using the “diff” command

Version Control


GitHub is a complementary hosting platform for your repositories (others are available).
Once updates have been committed to Git they can be “pushed” to GitHub.
Collaborators can then “fork” a copy of the repository and work on it locally whilst you are also still working on it, by pushing and pulling commits to GitHub.

Version Control


What a repository looks like on GitHub

Version Control


I would recommend reading this article which explains in more detail about how to use Git and GitHub.

Version Control


Git can be integrated into RStudio and therefore more easily be incorporated into your workflow.
Once installed an additional tab in the environment pane will appear, where you can commit and push files.

Version Control


Or you can go into the RStudio terminal tab and type Git commands from there

Version Control


To install Git and connect it to your GitHub and RStudio then follow this tutorial by Jenny Bryan. It talks through each setup step and how to do basic Git commands.